Author: Dr. Sheehan Olver |
Github Source
MATH50003 Numerical Analysis Lecture NotesAsymptotics and Computational Cost1. Asymptotics as
YouTube Lectures: Asymptotics and Computational Cost
We introduce Big-O, little-o and asymptotic notation and see how they can be used to describe computational cost.
Big-O, little-o, and "asymptotic to" are used to describe behaviour of functions at infinity.
Definition (Big-O)
means
is bounded for sufficiently large
. That is, there exist constants and such that, for all , .
Definition (little-O)
means
Definition (asymptotic to)
means
Example (integer in binary)
as
for
as
as
Note we sometimes write
We have some simple algebraic rules:
Proposition (Big-O rules)
We also have Big-O, little-o and "asymptotic to" at a point:
Definition (Big-O)
means
is bounded in a neighbourhood of
. That is, there exist constants and such that, for all , .
Definition (little-O)
means
Definition (asymptotic to)
means
Example (integer in binary)
Since
for some
provided
We will use Big-O notation to describe the computational cost of algorithms. Consider the following simple sum
which we might implement as:
xxxxxxxxxxfunction sumsq(x) n = length(x) ret = 0.0 for k = 1:n ret = ret + x[k]^2 end retend
n = 100x = randn(n)sumsq(x)Each step of this algorithm consists of one memory look-up (z = x[k]),
one multiplication (w = z*z) and one addition (ret = ret + w).
We will ignore the memory look-up in the following discussion.
The number of CPU operations per step is therefore 2 (the addition and multiplication).
Thus the total number of CPU operations is
Now consider a double sum like:
which we might implement as:
xxxxxxxxxxfunction sumsq2(x) n = length(x) ret = 0.0 for k = 1:n for j = 1:k ret = ret + x[j]^2 end end retend
n = 100x = randn(n)sumsq2(x)Now the inner loop is
operations.
YouTube Lectures: Integers Floating Point Numbers Rounding Bouding Rounding Errors
Reference: Overton
In this chapter, we introduce the Two's-complement
storage for integers and the
IEEE Standard for Floating-Point Arithmetic.
There are many possible ways of representing real numbers on a computer, as well as
the precise behaviour of operations such as addition, multiplication, etc.
Before the 1980s each processor had potentially a different representation for
real numbers, as well as different behaviour for operations.
IEEE introduced in 1985 was a means to standardise this across
processors so that algorithms would produce consistent and reliable results.
This chapter may seem very low level for a mathematics course but there are two important reasons to understand the behaviour of integers and floating-point numbers:
In this chapter we discuss the following:
BigInt, which uses variable bit length storage.BigFloat. Before we begin, we load two external packages. SetRounding.jl allows us
to set the rounding mode of floating-point arithmetic. ColorBitstring.jl
implements functions printbits (and printlnbits)
which print the bits (and with a newline) of floating-point numbers in colour.
xxxxxxxxxxusing SetRounding, ColorBitstring
Any integer can be presented in binary format, that is, a sequence of 0s and 1s.
Definition For
denote a non-negative integer in binary format by: For
, Denote a non-negative real number in binary format by:
First we show some examples of verifying a numbers binary representation:
Example (integer in binary)
A simple integer example is
Example (rational in binary)
Consider the number 1/3. In decimal recall that:
We will see that in binary
Both results can be proven using the geometric series:
provided
A similar argument with
On a computer one typically represents integers by a finite number of
Integers on a computer follow modular arithmetic:
Definition (ring of integers modulo
) Denote the ring
Integers represented with
Example (addition of 8-bit unsigned integers) Consider the addition of two 8-bit numbers:
The result is impossible to store in just 8-bits! It is way too slow
for a computer to increase the number of bits, or to throw an error (checks are slow).
So instead it treats the integers as elements of
We can see this in Julia:
xxxxxxxxxxx = UInt8(255)y = UInt8(1)printbits(x); println(" + "); printbits(y); println(" = ")printbits(x + y)Example (multiplication of 8-bit unsigned integers) Multiplication works similarly: for example,
We can see this behaviour in code by printing the bits:
xxxxxxxxxxx = UInt8(254) # 254 represented in 8-bits as an unsigned integery = UInt8(2) # 2 represented in 8-bits as an unsigned integerprintbits(x); println(" * "); printbits(y); println(" = ")printbits(x * y)Signed integers use the Two's complemement
convention. The convention is if the first bit is 1 then the number is negative: the number
Example (converting bits to signed integers)
What 8-bit integer has the bits 01001001? Adding the corresponding decimal places we get:
xxxxxxxxxx2^0 + 2^3 + 2^6What 8-bit (signed) integer has the bits 11001001? Because the first bit is 1 we know it's a negative
number, hence we need to sum the bits but then subtract 2^p:
xxxxxxxxxx2^0 + 2^3 + 2^6 + 2^7 - 2^8We can check the results using printbits:
xxxxxxxxxxprintlnbits(Int8(73))printbits(-Int8(55))Arithmetic works precisely the same for signed and unsigned integers.
Example (addition of 8-bit integers)
Consider (-1) + 1 in 8-bit arithmetic. The number 0. In other words:
Example (multiplication of 8-bit integers)
Consider (-2) * 2. -4.
In other words:
Example (overflow) We can find the largest and smallest instances of a type using typemax and typemin:
xxxxxxxxxxprintlnbits(typemax(Int8)) # 2^7-1 = 127printbits(typemin(Int8)) # -2^7 = -128As explained, due to modular arithmetic, when we add 1 to the largest 8-bit integer we get the smallest:
xxxxxxxxxxtypemax(Int8) + Int8(1) # returns typemin(Int8)This behaviour is often not desired and is known as overflow, and one must be wary of using integers close to their largest value.
An alternative representation for integers uses a variable number of bits,
with the advantage of avoiding overflow but with the disadvantage of a substantial
speed penalty. In Julia these are BigInts, which we can create by calling big on an
integer:
xxxxxxxxxxx = typemax(Int64) + big(1) # Too big to be an `Int64`Note in this case addition automatically promotes an Int64 to a BigInt.
We can create very large numbers using BigInt:
xxxxxxxxxxx^100Note the number of bits is not fixed, the larger the number, the more bits required
to represent it, so while overflow is impossible, it is possible to run out of memory if a number is
astronomically large: go ahead and try x^x (at your own risk).
In addition to +, -, and * we have integer division ÷, which rounds down:
xxxxxxxxxx5 ÷ 2 # equivalent to div(5,2)Standard division / (or \ for division on the right) creates a floating-point number,
which will be discussed shortly:
xxxxxxxxxx5 / 2 # alternatively 2 \ 5 We can also create rational numbers using //:
xxxxxxxxxx(1//2) + (3//4)Rational arithmetic often leads to overflow so it
is often best to combine big with rationals:
xxxxxxxxxxbig(102324)//132413023 + 23434545//4243061 + 23434545//42430534435Floating-point numbers are a subset of real numbers that are representable using a fixed number of bits.
Definition (floating-point numbers) Given integers
(the "exponential shift") (the number of exponent bits) and (the precision), define the set of Floating-point numbers by dividing into normal, sub-normal, and special number subsets: The normal numbers
are defined by The sub-normal numbers
are defined as The special numbers
are defined later.
Note this set of real numbers has no nice algebraic structure: it is not closed under addition, subtraction, etc. We will therefore need to define approximate versions of algebraic operations later.
Floating-point numbers are stored in
The first bit (
Finally, the bits
If
If
Definition (IEEE floating-point numbers) IEEE has 3 standard floating-point formats: 16-bit (half precision), 32-bit (single precision) and 64-bit (double precision) defined by:
In Julia these correspond to 3 different floating-point types:
Float64 is a type representing double precision (Float64 by including a
decimal point when writing the number:
1.0 is a Float64. Float64 is the default format for
scientific computing (on the Floating-Point Unit, FPU). Float32 is a type representing single precision (Float32 by including a
f0 when writing the number:
1f0 is a Float32. Float32 is generally the default format for graphics (on the Graphics Processing Unit, GPU),
as the difference between 32 bits and 64 bits is indistinguishable to the eye in visualisation,
and more data can be fit into a GPU's limitted memory.Float16 is a type representing half-precision (Example (rational in Float32) How is the number Float32?
Recall that
and since 01111101.
.
For the significand we round the last bit to the nearest element of
and the significand bits are 01010101010101010101011.
Thus the Float32 bits for
xxxxxxxxxxprintbits(1f0/3)For sub-normal numbers, the simplest example is zero, which has
xxxxxxxxxxprintbits(0.0)Unlike integers, we also have a negative zero:
xxxxxxxxxxprintbits(-0.0)This is treated as identical to 0.0 (except for degenerate operations as explained in special numbers).
When dealing with normal numbers there are some important constants that we will use to bound errors.
Definition (machine epsilon/smallest positive normal number/largest normal number) Machine epsilon is denoted
When
is implied by context we use the notation . The smallest positive normal number is and all zero: where
. The largest (positive) normal number is
We confirm the simple bit representations:
xxxxxxxxxxσ,Q,S = 127,23,8 # Float32εₘ = 2.0^(-S)printlnbits(Float32(2.0^(1-σ))) # smallest positive Float32printlnbits(Float32(2.0^(2^Q-2-σ) * (2-εₘ))) # largest Float32For a given floating-point type, we can find these constants using the following functions:
xxxxxxxxxxeps(Float32),floatmin(Float32),floatmax(Float32)Example (creating a sub-normal number) If we divide the smallest normal number by two, we get a subnormal number:
xxxxxxxxxxmn = floatmin(Float32) # smallest normal Float32printlnbits(mn)printbits(mn/2)Can you explain the bits?
The special numbers extend the real line by adding
Definition (not a number) Let
represent "not a number" and define
Whenever the bits of Inf and -Inf for 64-bit floating-point numbers (or Inf16, Inf32
for 16-bit and 32-bit, respectively):
xxxxxxxxxxprintlnbits(Inf16)printbits(-Inf16)All other special floating-point numbers represent NaN for 64-bit floating-point numbers (or NaN16, NaN32 for 16-bit and 32-bit, respectively):
xxxxxxxxxxprintbits(NaN16)These are needed for undefined algebraic operations such as:
xxxxxxxxxx0/0Example (many NaNs) What happens if we change some other
xxxxxxxxxxi = parse(UInt16, "0111110000010001"; base=2)reinterpret(Float16, i)Thus, there are more than one NaNs on a computer.
Arithmetic operations on floating-point numbers are exact up to rounding. There are three basic rounding strategies: round up/down/nearest. Mathematically we introduce a function to capture the notion of rounding:
Definition (rounding)
denotes the function that rounds a real number up to the nearest floating-point number that is greater or equal. denotes the function that rounds a real number down to the nearest floating-point number that is greater or equal. denotes the function that rounds a real number to the nearest floating-point number. In case of a tie, it returns the floating-point number whose least significant bit is equal to zero. We use the notation when and the rounding mode are implied by context, with being the default rounding mode.
In Julia, the rounding mode is specified by tags RoundUp, RoundDown, and
RoundNearest. (There are also more exotic rounding strategies RoundToZero, RoundNearestTiesAway and
RoundNearestTiesUp that we won't use.)
WARNING (rounding performance, advanced) These rounding modes are part
of the FPU instruction set so will be (roughly) equally fast as the default, RoundNearest.
Unfortunately, changing the rounding mode is expensive, and is not thread-safe.
Let's try rounding a Float64 to a Float32.
xxxxxxxxxxprintlnbits(1/3) # 64 bitsprintbits(Float32(1/3)) # round to nearest 32-bitThe default rounding mode can be changed:
xxxxxxxxxxprintbits(Float32(1/3,RoundDown) )Or alternatively we can change the rounding mode for a chunk of code
using setrounding. The following computes upper and lower bounds for /:
xxxxxxxxxxx = 1f0setrounding(Float32, RoundDown) do x/3end,setrounding(Float32, RoundUp) do x/3endWARNING (compiled constants, advanced): Why did we first create a variable x instead of typing 1f0/3?
This is due to a very subtle issue where the compiler is too clever for it's own good:
it recognises 1f0/3 can be computed at compile time, but failed to recognise the rounding mode
was changed.
In IEEE arithmetic, the arithmetic operations +, -, *, / are defined by the property
that they are exact up to rounding. Mathematically we denote these operations as follows:
Note also that ^ and sqrt are similarly exact up to rounding.
Example (decimal is not exact) 1.1+0.1 gives a different result than 1.2:
xxxxxxxxxxx = 1.1y = 0.1x + y - 1.2 # Not Zero?!?This is because
WARNING (non-associative) (non-associative) These operations are not associative! E.g.
xxxxxxxxxx(1.1 + 1.2) + 1.3, 1.1 + (1.2 + 1.3)Can you explain this in terms of bits?
Before we dicuss bounds on errors, we need to talk about the two notions of errors:
Definition (absolute/relative error) If
then is called the absolute error and is called the relative error in approximating by .
We can bound the error of basic arithmetic operations in terms of machine epsilon, provided a real number is close to a normal number:
Definition (normalised range) The normalised range
is the subset of real numbers that lies between the smallest and largest normal floating-point number: When
are implied by context we use the notation .
We can use machine epsilon to determine bounds on rounding:
Proposition (rounding arithmetic) If
then where the relative error is
This immediately implies relative error bounds on all IEEE arithmetic operations, e.g.,
if
where (assuming the default nearest rounding)
Example (bounding a simple computation) We show how to bound the error in computing
using floating-point arithmetic. First note that 1.1 on a computer is in
fact
First we find
where (note
Thus the computation becomes
where the absolute error is
Indeed, this bound is bigger than the observed error:
xxxxxxxxxxabs(3.6 - (1.1+1.2+1.3)), 4.8eps()Arithmetic works differently on Inf and NaN and for undefined operations.
In particular we have:
xxxxxxxxxx1/0.0 # Inf1/(-0.0) # -Inf0.0/0.0 # NaN Inf*0 # NaNInf+5 # Inf(-1)*Inf # -Inf1/Inf # 0.01/(-Inf) # -0.0Inf - Inf # NaNInf == Inf # trueInf == -Inf # false
NaN*0 # NaNNaN+5 # NaN1/NaN # NaNNaN == NaN # falseNaN != NaN # trueOther special functions like cos, sin, exp, etc. are not part of the IEEE standard.
Instead, they are implemented by composing the basic arithmetic operations, which accumulate
errors. Fortunately many are designed to have relative accuracy, that is, s = sin(x)
(that is, the Julia implementation of
where
xxxxxxxxxxπ₆₄ = Float64(π)πᵦ = big(π₆₄) # Convert 64-bit approximation of π to higher precision. Note its the same number.abs(sin(π₆₄)), abs(sin(π₆₄) - sin(πᵦ)) # only has relative accuracy compared to sin(πᵦ), not sin(π)Another issue is when
xxxxxxxxxxε = eps() # machine epsilon, 2^(-52)x = 2*10.0^100abs(sin(x) - sin(big(x))) ≤ abs(sin(big(x))) * εBut if we instead compute 10^100 using BigFloat we get a completely different
answer that even has the wrong sign!
xxxxxxxxxxx̃ = 2*big(10.0)^100sin(x), sin(x̃)This is because we commit an error on the order of roughly
when we round
Example (polynomial near root)
For general functions we do not generally have relative accuracy.
For example, consider a simple
polynomial
xxxxxxxxxxf = x -> 1 + 4x + x^2x = sqrt(3) - 2abserr = abs(f(big(x)) - f(x))relerr = abserr/abs(f(x))abserr, relerr # very large relative errorWe can see this in the error bound (note that
Using a simple bound f(x) itself is less than
It is possible to set the precision of a floating-point number
using the BigFloat type, which results from the usage of big
when the result is not an integer.
For example, here is an approximation of 1/3 accurate
to 77 decimal digits:
xxxxxxxxxxbig(1)/3Note we can set the rounding mode as in Float64, e.g.,
this gives (rigorous) bounds on
1/3:
xxxxxxxxxxsetrounding(BigFloat, RoundDown) do big(1)/3end, setrounding(BigFloat, RoundUp) do big(1)/3endWe can also increase the precision, e.g., this finds bounds on 1/3 accurate to
more than 1000 decimal places:
xxxxxxxxxxsetprecision(4_000) do # 4000 bit precision setrounding(BigFloat, RoundDown) do big(1)/3 end, setrounding(BigFloat, RoundUp) do big(1)/3 endendIn the problem sheet we shall see how this can be used to rigorously bound
YouTube Lectures: Finite Differences Dual Numbers
We now get to our first computational problem: given a function, how can we approximate its derivative at a point? Before we begin, we must be clear what a "function" is. Consider three possible scenarios:
Float64 and returns another Float64)
which we only know pointwise. This is the situation if we have a function that relies on a compiled C library,
which composes floating point arithmetic operations.
Since We discuss the following techniques:
Note there are other techniques for differentiation that we don't discuss:
xxxxxxxxxxusing ColorBitstringThe definition
tells us that
provided that
It's important to note that approximation uses only the black-box notion of a function but to obtain bounds we need more.
If we know a bound on
Proposition The error in approximating the derivative using finite differences is
where
.
Proof Follows immediately from Taylor's theorem:
for some
◼️
There are also alternative versions of finite differences. Leftside finite-differences:
and central differences:
Composing these approximations is useful for higher-order derivatives as we discuss in the problem sheet.
Note this is assuming real arithmetic, the answer is drastically different with floating point arithmetic.
Let's try differentiating two simple polynomials
xxxxxxxxxxf = x -> 1 + x + x^2 # we treat f and g as black-boxsg = x -> 1 + x/3 + x^2h = 0.000001(f(h)-f(0))/h, (g(h)-g(0))/hBoth seem to roughly approximate the true derivatives (
xxxxxxxxxxusing Plotsh = 2.0 .^ (0:-1:-60) # [1,1/2,1/4,…]nanabs = x -> iszero(x) ? NaN : abs(x) # avoid 0's in log scale plotplot(nanabs.((f.(h) .- f(0)) ./ h .- 1); yscale=:log10, title="convergence of derivatives, h = 2^(-n)", label="f", legend=:bottomleft)plot!(abs.((g.(h) .- g(0)) ./ h .- 1/3); yscale=:log10, label = "g")In the case of
It is clear that
xxxxxxxxxxh = 10.0 .^ (0:-1:-16) # [1,1/10,1/100,…]plot(abs.((f.(h) .- f(0)) ./ h .- 1); yscale=:log10, title="convergence of derivatives, h = 10^(-n)", label="f", legend=:bottomleft)plot!(abs.((g.(h) .- g(0)) ./ h .- 1/3); yscale=:log10, label = "g")For these two simple examples, we can understand why we see very different behaviour.
Example (convergence(?) of finite difference) Consider differentiating
Note that
Case 1 (
xxxxxxxxxxS = 10 # 10 significant bitsn = 3 # 3 ≤ S/2 = 5h = Float16(2)^(-n)printbits(f(h))Subtracting 1 and dividing by
which shows exponential convergence.
Case 2 (
Then
We have actually performed better than true real arithmetic and converged without a limit!
Case 3 (
Example (divergence of finite difference) Consider differentiating
Note we lose two bits each time in the computation of
xxxxxxxxxxn = 0; h = Float16(2)^(-n); printlnbits(1 + h/3)n = 2; h = Float16(2)^(-n); printlnbits(1 + h/3)n = 4; h = Float16(2)^(-n); printlnbits(1 + h/3)n = 8; h = Float16(2)^(-n); printlnbits(1 + h/3)It follows if
Therefore
Thus the error grows exponentially with
If
We can bound the error using the bounds on floating point arithmetic.
Theorem (finite-difference error bound) Let
be twice-differentiable in a neighbourhood of and assume that has uniform absolute accuracy in that neighbourhood, that is: for a fixed constant
. Assume for simplicity where and . Then the finite-difference approximation satisfies where
for
.
Proof
We have (noting by our assumptions
where
The bound then follows, using the very pessimistic bound
The three-terms of this bound tell us a story: the first term is a fixed (small) error, the second term tends to zero
as
Heuristic (finite-difference with floating-point step) Choose
In the case of double precision
Remark: While finite differences is of debatable utility for computing derivatives, it is extremely effective in building methods for solving differential equations, as we shall see later. It is also very useful as a "sanity check" if one wants something to compare with for other numerical methods for differentiation.
Automatic differentiation consists of applying functions to special types that determine the derivatives. Here we do so via dual numbers.
Definition (Dual numbers) Dual numbers
are a commutative ring over the reals generated by and such that . Dual numbers are typically written as where and are real.
This is very much analoguous to complex numbers, which are a field generated by
And just as we view
Applying a polynomial to a dual number
Theorem (polynomials on dual numbers) Suppose
is a polynomial. Then
Proof
It suffices to consider
We can extend real-valued differentiable functions to dual numbers in a similar manner.
First, consider a standard function with a Taylor series (e.g.
so that
More generally, given a differentiable function we can extend it to dual numbers:
Definition (dual extension) Suppose a real-valued function
is differentiable at . If then we say that it is a dual extension at
.
Thus, for basic functions we have natural extensions:
provided the function is differentiable at
Going further, we can add, multiply, and compose such functions:
Lemma (product and chain rule) If
is a dual extension at and is a dual extension at , then is a dual extension at . If and are dual extensions at then is also dual extensions at . In other words:
Proof
For
For
A simple corollary is that any function defined in terms of addition, multiplication, composition, etc. of functions that are dual with differentiation will be differentiable via dual numbers.
Example (differentiating non-polynomial)
Consider
and therefore we deduce that
We now consider a simple implementation of dual numbers that works on general polynomials:
xxxxxxxxxx# Dual(a,b) represents a + b*ϵstruct Dual{T} a::T b::Tend
# Dual(a) represents a + 0*ϵDual(a::Real) = Dual(a, zero(a)) # for real numbers we use a + 0ϵ
# Allow for a + b*ϵ syntaxconst ϵ = Dual(0, 1)
import Base: +, *, -, /, ^, zero, exp
# support polynomials like 1 + x, x - 1, 2x or x*2 by reducing to Dual+(x::Real, y::Dual) = Dual(x) + y+(x::Dual, y::Real) = x + Dual(y)-(x::Real, y::Dual) = Dual(x) - y-(x::Dual, y::Real) = x - Dual(y)*(x::Real, y::Dual) = Dual(x) * y*(x::Dual, y::Real) = x * Dual(y)
# support x/2 (but not yet division of duals)/(x::Dual, k::Real) = Dual(x.a/k, x.b/k)
# a simple recursive function to support x^2, x^3, etc.function ^(x::Dual, k::Integer) if k < 0 error("Not implemented") elseif k == 1 x else x^(k-1) * x endend
# Algebraic operationds for duals-(x::Dual) = Dual(-x.a, -x.b)+(x::Dual, y::Dual) = Dual(x.a + y.a, x.b + y.b)-(x::Dual, y::Dual) = Dual(x.a - y.a, x.b - y.b)*(x::Dual, y::Dual) = Dual(x.a*y.a, x.a*y.b + x.b*y.a)
exp(x::Dual) = Dual(exp(x.a), exp(x.a) * x.b)We can also try it on the two polynomials as above:
xxxxxxxxxxf = x -> 1 + x + x^2g = x -> 1 + x/3 + x^2f(ϵ).b, g(ϵ).bThe first example exactly computes the derivative, and the second example is exact up to the last bit rounding! It also works for higher order polynomials:
xxxxxxxxxxf = x -> 1 + 1.3x + 2.1x^2 + 3.1x^3f(0.5 + ϵ).b - 5.725It is indeed "accurate to (roughly) 16-digits", the best we can hope for using floating point.
We can use this in "algorithms" as well as simple polynomials.
Consider the polynomial
xxxxxxxxxxfunction s(n, x) ret = 1 + x # first two terms for k = 2:n ret += x^k end retends(10, 0.1 + ϵ).bThis matches exactly the "true" (up to rounding) derivative:
xxxxxxxxxxsum((1:10) .* 0.1 .^(0:9))Finally, we can try the more complicated example:
xxxxxxxxxxf = x -> exp(x^2 + exp(x))f(1 + ϵ)What makes dual numbers so effective is that, unlike finite differences, they are not prone to disasterous growth due to round-off errors.
YouTube Lectures: Structured Matrices Permutation Matrices Orthogonal Matrices
We have seen how algebraic operations (+, -, *, /) are
well-defined for floating point numbers. Now we see how this allows us
to do (approximate) linear algebra operations on structured matrices. That is,
we consider the following structures:
xxxxxxxxxxusing LinearAlgebra, Plots, BenchmarkToolsA Vector of a primitive type (like Int or Float64) is stored
consecutively in memory. E.g. if we have a Vector{Int8} of length
n then it is stored as 8n bits (n bytes) in a row.
A Matrix is stored consecutively in memory, going down column-by-
column. That is,
xxxxxxxxxxA = [1 2; 3 4; 5 6]Is actually stored equivalently to a length 6 vector:
xxxxxxxxxxvec(A)This is known as column-major format.
Remark: Note that transposing A is done lazyily and so A'
stores the entries by row. That is, A' is stored in
row-major format.
Matrix-vector multiplication works as expected:
xxxxxxxxxxx = [7, 8]A*xNote there are two ways this can be implemented: either the traditional definition, going row-by-row:
or going column-by-column:
It is easy to implement either version of matrix-multiplication in terms of the algebraic operations we have learned, in this case just using integer arithmetic:
xxxxxxxxxx## go row-by-rowfunction mul_rows(A, x) m,n = size(A) # promote_type type finds a type that is compatible with both types, eltype gives the type of the elements of a vector / matrix c = zeros(promote_type(eltype(x),eltype(A)), m) for k = 1:m, j = 1:n c[k] += A[k, j] * x[j] end cend
## go column-by-columnfunction mul(A, x) m,n = size(A) # promote_type type finds a type that is compatible with both types, eltype gives the type of the elements of a vector / matrix c = zeros(promote_type(eltype(x),eltype(A)), m) for j = 1:n, k = 1:m c[k] += A[k, j] * x[j] end cend
mul_rows(A, x), mul(A, x)Either implementation will be mul accesses the entries of A going down the column,
which happens to be significantly faster than mul_rows, due to accessing
memory of A in order. We can see this by measuring the time it takes using @btime:
xxxxxxxxxxn = 1000A = randn(n,n) # create n x n matrix with random normal entriesx = randn(n) # create length n vector with random normal entries
mul_rows(A,x) mul(A,x) A*x; # built-in, high performance implementation. USE THIS in practiceHere ms means milliseconds (0.001 = 10^(-3) seconds) and μs means microseconds (0.000001 = 10^(-6) seconds).
So we observe that mul is roughly 3x faster than mul_rows, while the optimised * is roughly 5x faster than mul.
Remark: (advanced) For floating point types, A*x is implemented in BLAS which is generally multi-threaded
and is not identical to mul(A,x), that is, some inputs will differ in how the computations
are rounded.
Note that the rules of arithmetic apply here: matrix multiplication with floats will incur round-off error (the precise details of which are subject to the implementation):
xxxxxxxxxxA = [1.4 0.4; 2.0 1/2]A * [1, -1] # First entry has round-off error, but 2nd entry is exactAnd integer arithmetic will be prone to overflow:
xxxxxxxxxxA = fill(Int8(2^6), 2, 2) # make a matrix whose entries are all equal to 2^6A * Int8[1,1] # we have overflowed and get a negative number -2^7Solving a linear system is done using \:
xxxxxxxxxxA = [1 2 3; 1 2 4; 3 7 8]b = [10; 11; 12]A \ bDespite the answer being integer-valued,
here we see that it resorted to using floating point arithmetic,
incurring rounding error.
But it is "accurate to (roughly) 16-digits".
As we shall see, the way solving a linear system works is we first write A as a
product of simpler matrices, e.g., a product of triangular matrices.
Remark: (advanced) For floating point types, A \ x is implemented in LAPACK, which
like BLAS is generally multi-threaded and in fact different machines may round differently.
Triangular matrices are represented by dense square matrices where the entries below the diagonal are ignored:
xxxxxxxxxxA = [1 2 3; 4 5 6; 7 8 9]U = UpperTriangular(A)We can see that U is storing all the entries of A:
xxxxxxxxxxU.dataSimilarly we can create a lower triangular matrix by ignoring the entries above the diagonal:
xxxxxxxxxxL = LowerTriangular(A)If we know a matrix is triangular we can do matrix-vector multiplication in roughly half
the number of operations.
Moreover, we can easily invert matrices.
Consider a simple 3×3 example, which can be solved with \:
xxxxxxxxxxb = [5,6,7]x = U \ bBehind the seens, \ is doing back-substitution: considering the last row, we have all
zeros apart from the last column so we know that x[3] must be equal to:
xxxxxxxxxxb[3] / U[3,3]Once we know x[3], the second row states U[2,2]*x[2] + U[2,3]*x[3] == b[2], rearranging
we get that x[2] must be:
xxxxxxxxxx(b[2] - U[2,3]*x[3])/U[2,2]Finally, the first row states U[1,1]*x[1] + U[1,2]*x[2] + U[1,3]*x[3] == b[1] i.e.
x[1] is equal to
xxxxxxxxxx(b[1] - U[1,2]*x[2] - U[1,3]*x[3])/U[1,1]More generally, we can solve the upper-triangular system
by computing
The problem sheet will explore implementing this method, as well
as forward substitution for inverting lower triangular matrices. The cost of multiplying and solving linear systems with a
triangular matrix is
A banded matrix is zero off a prescribed number of diagonals. We call the number of (potentially) non-zero diagonals the bandwidths:
Definition (bandwidths) A matrix
has lower-bandwidth if for all and upper-bandwidth if for all . We say that it has strictly lower-bandwidth if it has lower-bandwidth and there exists a such that . We say that it has strictly upper-bandwidth if it has upper-bandwidth and there exists a such that .
Definition (Diagonal) Diagonal matrices are square matrices with bandwidths
.
Diagonal matrices in Julia are stored as a vector containing the diagonal entries:
xxxxxxxxxxx = [1,2,3]D = Diagonal(x)It is clear that we can perform diagonal-vector multiplications and solve linear systems involving diagonal matrices efficiently
(in
Definition (Bidiagonal) If a square matrix has bandwidths
it is lower-bidiagonal and if it has bandwidths it is upper-bidiagonal.
We can create Bidiagonal matrices in Julia by specifying the diagonal and off-diagonal:
xxxxxxxxxxBidiagonal([1,2,3], [4,5], :L)xxxxxxxxxxBidiagonal([1,2,3], [4,5], :U)Multiplication and solving linear systems with Bidiagonal systems is also
Definition (Tridiagonal) If a square matrix has bandwidths
it is tridiagonal.
Julia has a type Tridiagonal for representing a tridiagonal matrix from its sub-diagonal, diagonal, and super-diagonal:
xxxxxxxxxxTridiagonal([1,2], [3,4,5], [6,7])Tridiagonal matrices will come up in second-order differential equations and orthogonal polynomials.
We will later see how linear systems involving tridiagonal matrices can be solved in
Permutation matrices are matrices that represent the action of permuting the entries of a vector,
that is, matrix representations of the symmetric group
where
We can encode a permutation in vector
Example (permutation of a vector)
Consider the permutation
We can apply it to a vector:
xxxxxxxxxxσ = [1, 4, 2, 5, 3]v = [6, 7, 8, 9, 10]v[σ] # we permutate entries of vIts inverse permutation
Julia has the function invperm for computing the vector that encodes
the inverse permutation:
And indeed:
xxxxxxxxxxσ⁻¹ = invperm(σ) # note that ⁻¹ are just unicode characters in the variable nameAnd indeed permuting the entries by σ and then by σ⁻¹ returns us
to our original vector:
xxxxxxxxxxv[σ][σ⁻¹] # permuting by σ and then σⁱ gets us back
Note that the operator
is linear in
The entries of this matrix are
where
This construction motivates the following definition:
Definition (permutation matrix)
is a permutation matrix if it is equal to the identity matrix with its rows permuted.
Example (5×5 permutation matrix)
We can construct the permutation representation for
xxxxxxxxxxP = I(5)[σ,:]And indeed, we see its action is as expected:
xxxxxxxxxxP * vRemark: (advanced) Note that P is a special type SparseMatrixCSC. This is used
to represent a matrix by storing only the non-zero entries as well as their location.
This is an important data type in high-performance scientific computing, but we will not
be using general sparse matrices in this module.
Proposition (permutation matrix inverse) Let
be a permutation matrix corresponding to the permutation . Then That is,
is orthogonal:
Proof
We prove orthogonality via:
This shows
Permutation matrices are examples of sparse matrices that can be very easily inverted.
Definition (orthogonal matrix) A square matrix is orthogonal if its inverse is its transpose:
We have already seen an example of an orthogonal matrices (permutation matrices). Here we discuss two important special cases: simple rotations and reflections.
Definition (simple rotation) A 2×2 rotation matrix through angle
is
In what follows we use the following for writing the angle of a vector:
Definition (two-arg arctan) The two-argument arctan function gives the angle
θthrough the point, i.e., It can be defined in terms of the standard arctan as follows:
This is available in Julia:
xxxxxxxxxxatan(-1,-2) # angle through [-2,-1]We can rotate an arbitrary vector to the unit axis. Interestingly it only requires basic algebraic functions (no trigonometric functions):
Proposition (rotation of a vector) The matrix
is a rotation matrix satisfying
Proof
The last equation is trivial so the only question is that it is a rotation matrix.
Define
In addition to rotations, another type of orthognal matrix are reflections:
Definition (reflection matrix) Given a vector
satisfying , the reflection matrix is the orthogonal matrix
These are reflections in the direction of
Proposition
satisfies:
- Symmetry:
- Orthogonality:
is an eigenvector of with eigenvalue is a rank-1 perturbation of
Proof
Property 1 follows immediately. Property 2 follows from
Property 3 follows since
Property 4 follows since
In other words,
Example (reflection through 2-vector) Consider reflection through
Note this indeed has unit norm:
Thus the reflection matrix is:
Indeed it is symmetric, and orthogonal. It sends
Any vector orthogonal to
Note that building the matrix
Just as rotations can be used to rotate vectors to be aligned with coordinate axis, so can reflections,
but in this case it works for vectors in
Definition (Householder reflection) For a given vector
, define the Householder reflection for
and . The default choice in sign is:
Lemma (Householder reflection)
Proof Note that
where
Why do we choose the the opposite sign of
xxxxxxxxxxh = 10.0^(-10)x = [1,h]y = -norm(x)*[1,0] + xw = y/norm(y)Q = I-2w*w'Q*xIt didn't work! Even worse is if h = 0:
xxxxxxxxxxh = 0x = [1,h]y = -norm(x)*[1,0] + xw = y/norm(y)Q = I-2w*w'Q*xThis is because y has large relative error due to cancellation
from floating point errors in computing the first entry x[1] - norm(x).
(Or has norm zero if h=0.)
We avoid this cancellation by using the default choice:
xxxxxxxxxxh = 10.0^(-10)x = [1,h]y = sign(x[1])*norm(x)*[1,0] + xw = y/norm(y)Q = I-2w*w'Q*xYouTube Lectures: Least squares and QR Gram-Schmidt and Reduced QR Householder and QR PLU Decomposition Cholesky Decomposition
We now look at several decompositions (or factorisations)
of a matrix into products of structured matrices, and their use in solving least squares problems and linear systems.
For a square or rectangular matrix
where
where
For a square matrix we consider the PLU decomposition:
where
Finally, for a square, symmetric positive definite (
The importance of these decomposition for square matrices is that their component pieces are easy to invert on a computer:
and we saw last lecture that triangular and orthogonal matrices are easy to invert when applied to a vector
In this lecture we discuss the followng:
xxxxxxxxxxusing LinearAlgebra, Plots, BenchmarkTools
Here we consider rectangular matrices with more rows than columns.
A QR decomposition decomposes a matrix into an orthogonal matrix
We can use it to solve a least squares problem using the norm-preserving property (see PS3) of orthogonal matrices:
Now note that the rows
This norm is minimisable if it is attained. Provided the column rank of
Example (quadratic fit) Suppose we want to fit noisy data by a quadratic
That is, we want to choose
where
We can solve this using the QR decomposition:
xxxxxxxxxxm,n = 100,3
x = range(0,1; length=m) # 100 pointsf = 2 .+ x .+ 2x.^2 .+ 0.1 .* randn.() # Noisy quadratic
A = x .^ (0:2)' # 100 x 3 matrix, equivalent to [ones(m) x x.^2]Q,R̂ = qr(A)Q̂ = Q[:,1:n] # Q represents full orthogonal matrix so we take first 3 columns
p₀,p₁,p₂ = R̂ \ Q̂'fWe can visualise the fit:
xxxxxxxxxxp = x -> p₀ + p₁*x + p₂*x^2
scatter(x, f; label="samples", legend=:bottomright)plot!(x, p.(x); label="quadratic")Note that \ with a rectangular system does least squares by default:
xxxxxxxxxxA \ f
How do we compute the QR decomposition? We begin with a method you may have seen before in another guise. Write
where
In other words: the columns of R̂ is triangular we have
for all
while if
It is possible to find an orthogonal basis using the Gram–Schmidt algorithm, We construct it via induction: assume that
where
for
so that for
Then we define
which sastisfies
We now reinterpret this construction as a reduced QR decomposition.
Define
Thus
That is, we are computing the reduced QR decomposition column-by-column.
Running this algorithm to
We are going to compute the reduced QR of a random matrix
xxxxxxxxxxm,n = 5,4A = randn(m,n)Q,R̂ = qr(A)Q̂ = Q[:,1:n]The first column of Q̂ is indeed a normalised first column of A:
xxxxxxxxxxR = zeros(n,n)Q = zeros(m,n)R[1,1] = norm(A[:,1])Q[:,1] = A[:,1]/R[1,1]We now determine the next entries as
xxxxxxxxxxR[1,2] = Q[:,1]'A[:,2]v = A[:,2] - Q[:,1]*R[1,2]R[2,2] = norm(v)Q[:,2] = v/R[2,2]And the third column is then:
xxxxxxxxxxR[1,3] = Q[:,1]'A[:,3]R[2,3] = Q[:,2]'A[:,3]v = A[:,3] - Q[:,1:2]*R[1:2,3]R[3,3] = norm(v)Q[:,3] = v/R[3,3](Note the signs may not necessarily match.)
We can clean this up as a simple algorithm:
xxxxxxxxxxfunction gramschmidt(A) m,n = size(A) m ≥ n || error("Not supported") R = zeros(n,n) Q = zeros(m,n) for j = 1:n for k = 1:j-1 R[k,j] = Q[:,k]'*A[:,j] end v = A[:,j] - Q[:,1:j-1]*R[1:j-1,j] R[j,j] = norm(v) Q[:,j] = v/R[j,j] end Q,Rend
Q,R = gramschmidt(A)norm(A - Q*R)We see within the for j = 1:n loop that we have
Unfortunately, the Gram–Schmidt algorithm is unstable: the rounding errors when implemented in floating point accumulate in a way that we lose orthogonality:
xxxxxxxxxxA = randn(300,300)Q,R = gramschmidt(A)norm(Q'Q-I)As an alternative, we will consider using Householder reflections to introduce zeros below the diagonal. Thus, if Gram–Schmidt is a process of triangular orthogonalisation (using triangular matrices to orthogonalise), Householder reflections is a process of orthogonal triangularisation (using orthogonal matrices to triangularise).
Consider multiplication by the Householder reflection corresponding to the first column, that is, for
consider
where
noting
But now consider
so that
where
Thus the first two columns are triangular.
The inductive construction is thus clear. If we define
Then
i.e.
The implementation is cleaner. We do a naive implementation here:
xxxxxxxxxxfunction householderreflection(x) y = copy(x) # we cannot use sign(x[1]) in case x[1] == 0 y[1] += (x[1] ≥ 0 ? 1 : -1)*norm(x) w = y/norm(y) I - 2*w*w'endfunction householderqr(A) m,n = size(A) R = copy(A) Q = Matrix(1.0I, m, m) for j = 1:n Qⱼ = householderreflection(R[j:end,j]) R[j:end,:] = Qⱼ*R[j:end,:] Q[:,j:end] = Q[:,j:end]*Qⱼ end Q,Rend
m,n = 7,5A = randn(m, n)Q,R = householderqr(A)Q*R ≈ ANote because we are forming a full matrix representation of each Householder
reflection this is a slow algorithm, taking
Example We will now do an example by hand. Consider the
For the first column we have
so that
In this example the next column is already upper-triangular, but because of our choice of reflection we will end up swapping the sign, that is
so that
The final reflection is
giving us
That is,
Just as Gram–Schmidt can be reinterpreted as a reduced QR decomposition, Gaussian elimination with pivoting can be interpreted as a PLU decomposition.
Consider the following set of
where
These satisify the following special properties:
Proposition (one-column lower triangular inverse) If
then
Proposition (one-column lower triangular multiplication) If
and for then
Lemma (one-column lower triangular with pivoting) If
is a permutation that leaves the first rows fixed (that is, for ) and then where
.
Proof Write
where
Then we have
noting that
Before discussing pivoting, consider standard Gaussian elimation where one row-reduces to introduce zeros column-by-column. We will mimick the computation of the QR decomposition to view this as a triangular triangularisation.
Consider the matrix
We then have
where
Then
where
Thus the first two columns are triangular.
The inductive construction is again clear. If we define
Then
i.e.
Writing
and using the properties of inversion and multiplication we therefore deduce
Example (computer) We will do a numerical example (by-hand is equivalent to Gaussian elimination). The first lower triangular matrix is:
xxxxxxxxxxn = 4A = randn(n,n)L₁ = I -[0; A[2:end,1]/A[1,1]] * [1 zeros(1,n-1)]Which indeed introduces zeros in the first column:
xxxxxxxxxxA₁ = L₁*ANow we make the next lower triangular operator:
xxxxxxxxxxL₂ = I - [0; 0; A₁[3:end,2]/A₁[2,2]] * [0 1 zeros(1,n-2)]So that
xxxxxxxxxxA₂ = L₂*L₁*AThe last one is:
xxxxxxxxxxL₃ = I - [0; 0; 0; A₂[4:end,3]/A₂[3,3]] * [0 0 1 zeros(1,n-3)]Giving us
xxxxxxxxxxU = L₃*L₂*L₁*Aand
xxxxxxxxxxL = inv(L₁)*inv(L₂)*inv(L₃)Note how the entries of L are indeed identical to the negative
lower entries of L₁, L₂ and L₃.
Example (by-hand)
Consider the matrix
We have
We then deduce
We learned in first year linear algebra that if a diagonal entry is zero when doing Gaussian elimnation one has to row pivot. For stability, in implementation one always pivots: swap the largest in magnitude entry for the entry on the diagonal. In terms of a decomposition, this leads to
where
Thus we can deduce that
where the tilde denotes the combined actions of swapping the permutation and lower-triangular matrices, that is,
where
Writing
and using the properties of inversion and multiplication we therefore deduce
Example
Again we consider the matrix
Even though
We now pivot and upper triangularise the second column:
We now move
That is
We see how this example is done on a computer:
xxxxxxxxxxA = [1 1 1; 2 4 8; 1 4 9]L,U,σ = lu(A) # σ is a vector encoding the permutationThe permutation is
xxxxxxxxxxb = randn(3)U\(L\b[σ]) == A\bNote the entries match exactly because this precisely what \ is using.
Cholesky Decomposition is a form of Gaussian elimination (without pivoting) that exploits symmetry in the problem, resulting in a substantial speedup. It is only relevant for symmetric positive definite matrices.
Definition (positive definite) A square matrix
is positive definite if for all we have
First we establish some basic properties of positive definite matrices:
Proposition (conj. pos. def.) If
is positive definite and is non-singular then is positive definite.
Proposition (diag positivity) If
is positive definite then its diagonal entries are positive: .
Theorem (subslice pos. def.) If
is positive definite and is a vector of integers where any integer appears only once, then is also positive definite.
We leave the proofs to the problem sheets. Here is the key result:
Theorem (Cholesky and sym. pos. def.) A matrix
is symmetric positive definite if and only if it has a Cholesky decomposition where the diagonals of
are positive.
Proof If
where we use the fact that
For the other direction we will prove it by induction, with the
Note that
and hence
satisfies
Note hidden in this proof is a simple algorithm form computing the Cholesky decomposition. We define
Then
This algorithm succeeds if and only if
Example Consider the matrix
Then
Continuing, we have
Finally
Thus we get
The different decompositions have trade-offs between speed and stability. First we compare the speed of the different decompositions on a symmetric positive definite matrix, from fastest to slowest:
xxxxxxxxxxn = 100A = Symmetric(rand(n,n)) + 100I # shift by 10 ensures positivity cholesky(A); lu(A); qr(A);On my machine, cholesky is ~1.5x faster than lu,
which is ~2x faster than QR.
In terms of stability, QR computed with Householder reflections (and Cholesky for positive definite matrices) are stable, whereas LU is usually unstable (unless the matrix is diagonally dominant). PLU is a very complicated story: in theory it is unstable, but the set of matrices for which it is unstable is extremely small, so small one does not normally run into them.
Here is an example matrix that is in this set.
xxxxxxxxxxfunction badmatrix(n) A = Matrix(1I, n, n) A[:,end] .= 1 for j = 1:n-1 A[j+1:end,j] .= -1 end AendA = badmatrix(5)Note that pivoting will not occur (we do not pivot as the entries below the diagonal are the same magnitude as the diagonal), thus the PLU Decomposition is equivalent to an LU decomposition:
xxxxxxxxxxL,U = lu(A)But here we see an issue: the last column of U is growing exponentially fast! Thus when n is large
we get very large errors:
xxxxxxxxxxn = 100b = randn(n)A = badmatrix(n)norm(A\b - qr(A)\b) # A \ b still uses luNote qr is completely fine:
xxxxxxxxxxnorm(qr(A)\b - qr(big.(A)) \b) # roughly machine precisionAmazingly, PLU is fine if applied to a small perturbation of A:
xxxxxxxxxxε = 0.000001Aε = A .+ ε .* randn.()norm(Aε \ b - qr(Aε) \ b) # Now it matches!
The big open problem in numerical linear algebra is to prove that the set of matrices for which PLU fails has extremely small measure.
YouTube Lectures: Matrix Norms Singular Value Decomposition
In this lecture we discuss matrix and vector norms. The matrix
xxxxxxxxxxusing LinearAlgebra, PlotsRecall the definition of a (vector-)norm:
Definition (vector-norm) A norm
on is a function that satisfies the following, for and :
- Triangle inequality:
- Homogeneity:
- Positive-definiteness:
implies that .
Consider the following example:
Definition (p-norm) For
and , define the -norm: where
is the -th entry of . For we define
Theorem (p-norm)
is a norm for .
Proof
We will only prove the case
Homogeneity and positive-definiteness are straightforward: e.g.,
and if
For
and
For
That is, we have
In Julia, one can use the inbuilt norm function to calculate norms:
xxxxxxxxxxnorm([1,-2,3]) == norm([1,-2,3], 2) == sqrt(1^2 + 2^2 + 3^2);norm([1,-2,3], 1) == sqrt(1 + 2 + 3);norm([1,-2,3], Inf) == 3; Just like vectors, matrices have norms that measure their "length". The simplest example is the Fröbenius norm,
defined for an
This is available as norm in Julia:
xxxxxxxxxxA = randn(5,3)norm(A) == norm(vec(A))While this is the simplest norm, it is not the most useful. Instead, we will build a matrix norm from a vector norm:
Definition (matrix-norm) Suppose
and consider two norms on and on . Define the (induced) matrix norm as: Also define
For the induced 2, 1, and
Note an equivalent definition of the induced norm:
This follows since we can scale
Lemma (matrix norms are norms) Induced matrix norms are norms, that is for
we have:
- Triangle inequality:
- Homogeneneity:
- Positive-definiteness:
In addition, they satisfy the following additional properties:
Proof
First we show the triangle inequality:
Homogeneity is also immediate. Positive-definiteness follows from the fact that if
We have some simple examples of induced norms:
Example (
that is, the maximum
But the bound is also attained since if
In the problem sheet we see that
that is, the maximum
Matrix norms are available via opnorm:
xxxxxxxxxxm,n = 5,3A = randn(m,n)opnorm(A,1) == maximum(norm(A[:,j],1) for j = 1:n)opnorm(A,Inf) == maximum(norm(A[k,:],1) for k = 1:m)opnorm(A) # the 2-normAn example that does not have a simple formula is
Proposition (diagonal/orthogonal 2-norms) If
is diagonal with entries then . If is orthogonal then .
To define the induced
Definition (singular value decomposition) For
with rank , the reduced singular value decomposition (SVD) is where
and have orthonormal columns and is diagonal whose diagonal entries, which which we call singular values, are all positive and decreasing: . The full singular value decomposition (SVD) is where
and are orthogonal matrices and has only diagonal entries, i.e., if , and if
, where
if .
To show the SVD exists we first establish some properties of a Gram matrix (
Proposition (Gram matrix kernel) The kernel of
is the also the kernel of .
Proof
If
which means
Proposition (Gram matrix diagonalisation) The Gram-matrix satisfies
where
is orthogonal and the eigenvalues are non-negative.
Proof
This connection allows us to prove existence:
Theorem (SVD existence) Every
has an SVD.
Proof Consider
Assume (as usual) that the eigenvalues are sorted in decreasing modulus, and so
the corresponding (orthonormal) eigenvectors, with
the corresponding kernel. Define
Now define
which is orthogonal since
Thus we have
where we use the fact that
Singular values tell us the 2-norm:
Corollary (singular values and norm)
and if
is invertible, then
Proof
First we establish the upper-bound:
This is attained using the first right singular vector:
The inverse result follows since the inverse has SVD
is the SVD of
is the permutation that reverses the entries, that is,
We will not discuss in this module computation of singular value decompositions or eigenvalues: they involve iterative algorithms (actually built on a sequence of QR decompositions).
One of the main usages for SVDs is low-rank approximation:
Theorem (best low rank approximation) The matrix
is the best 2-norm approximation of
by a rank matrix, that is, for all rank- matrices , we have
Proof We have
Suppose a rank-
For all
But for all
i.e.,
The dimension of the span of
Here we show an example of a simple low-rank approximation using the SVD. Consider the Hilbert matrix:
xxxxxxxxxxfunction hilbertmatrix(n) ret = zeros(n,n) for j = 1:n, k=1:n ret[k,j] = 1/(k+j-1) end retendhilbertmatrix(5)That is, the
xxxxxxxxxxH = hilbertmatrix(100)U,σ,V = svd(H)scatter(σ; yscale=:log10)Note numerically we typically do not get a exactly zero singular values so the rank is always
treated as
xxxxxxxxxxk = 20 # rankΣ_k = Diagonal(σ[1:k])U_k = U[:,1:k]V_k = V[:,1:k]norm(U_k * Σ_k * V_k' - H)Note that this can be viewed as a compression algorithm: we have replaced a matrix with
We have seen that floating point arithmetic induces errors in computations, and that we can typically
bound the absolute errors to be proportional to
To justify what follows, we first observe that errors in implementing matrix-vector multiplication
can be captured by considering the multiplication to be exact on the wrong matrix: that is, A*x
(implemented with floating point) is precisely
To discuss floating point errors we need to be precise which order the operations happened.
We will use the definition mul(A,x), which denote mul_rows actually
does the exact same operations, just in a different order.) Note that each entry of the result is in fact a dot-product
of the corresponding rows so we first consider the error in the dot product dot(𝐱,𝐲) as implemented in floating-point,
which we denote
We first need a helper proposition:
Proposition If
and , then for some constant
satisfying .
The proof is left as an exercise (Hint: use induction).
Lemma (dot product backward error) For
, where
where
means absolute-value of each entry.
Proof
Note that
where we denote the errors from multiplication as
Thus
and the theorem follows from homogeneity:
Theorem (matrix-vector backward error) For
and we have where
Therefore
Proof
The bound on
This leaves the 2-norm example, which is a bit more challenging as there are matrices
where
So now we get to a mathematical question independent of floating point: can we bound the relative error in approximating
if we know a bound on
Definition (condition number) For a square matrix
, the condition number (in -norm) is with the
-norm:
Theorem (relative-error for matrix-vector) The worst-case relative error in
is if we have the relative pertubation error
.
Proof
We can assume
Thus we have:
Thus for floating point arithmetic we know the error is bounded by
If one uses QR to solve
YouTube Lectures: Condition Numbers Indefinite integration via Finite Differences Euler Methods Poisson Equation Convergence of Finite Differences
We now see our first application: solving differential equations. We will focus on the following differential equations:
Our approach to solving these is to
for a vector
Remark: (advanced) One should normally not need to implement these methods oneselves as there
are packages available, e.g. DifferentialEquations.jl. Moreover Forward and Backward
Euler are only the first baby steps to a wide range of time-steppers, with Runge–Kutta being
one of the most successful.
For example we can solve
a simple differential equation like a pendulum
xxxxxxxxxxusing DifferentialEquations, LinearAlgebra, Plots
u = solve(ODEProblem((u,_,x) -> [u[2], -sin(u[1])], [1,0], (0,10)))plot(u)However, even in these automated packages one has a choice of different methods with different behaviour, so it is important to understand what is happening.
In this section we consider the forward and backward Euler methods, which are based on forward and backward difference approximations to the derivative. In the problem sheet we will investigate a rule that that takes the average of the two (with significant benefits). We first discuss the simplest case of indefinite integration, where central differences is also applicable, then introduce forward and backward Euler for linear scalar, linear vector, and nonlinear differential equations.
We begin with the simplest differential equation on an interval
Using the forward-difference (which is the standard finite-difference) approximation we
choose
That is, where
This condition can be recast as a linear system:
where the super-script
This is a lower-triangular bidiagonal system, so can be solved using
forward substitution in
We can also consider discretisation at the mid-point
That is, we have the exact same system with a different right-hand side:
And of course there is
which we leave as an exercise.
Example
Let's do an example of integrating Bidiagonal matrix:
xxxxxxxxxxusing LinearAlgebra, Plots
function indefint(x) h = step(x) # x[k+1]-x[k] n = length(x) L = Bidiagonal([1; fill(1/h, n-1)], fill(-1/h, n-1), :L)end
n = 10x = range(0, 1; length=n)L = indefint(x)We can now solve for our particular problem using both the left and mid-point rules:
xxxxxxxxxxc = 0 # u(0) = 0f = x -> cos(x)
m = (x[1:end-1] + x[2:end])/2 # midpoints
𝐟ᶠ = f.(x[1:end-1]) # evaluate f at all but last points𝐟ᵐ = f.(m) # evaluate f at mid-points𝐮ᶠ = L \ [c; 𝐟ᶠ] # integrate using forward-differences𝐮ᵐ = L \ [c; 𝐟ᵐ] # integrate using central-differences
plot(x, sin.(x); label="sin(x)", legend=:bottomright)scatter!(x, 𝐮ᶠ; label="forward")scatter!(x, 𝐮ᵐ; label="mid")They both are close though the mid-point version is significantly more accurate. We can estimate how fast it converges:
xxxxxxxxxx## Error from indefinite integration with c and ffunction forward_err(u, c, f, n) x = range(0, 1; length = n) uᶠ = indefint(x) \ [c; f.(x[1:end-1])] norm(uᶠ - u.(x), Inf)end
function mid_err(u, c, f, n) x = range(0, 1; length = n) m = (x[1:end-1] + x[2:end]) / 2 # midpoints uᵐ = indefint(x) \ [c; f.(m)] norm(uᵐ - u.(x), Inf)end
ns = 10 .^ (1:8) # solve up to n = 10 millionscatter(ns, forward_err.(sin, 0, f, ns); xscale=:log10, yscale=:log10, label="forward")scatter!(ns, mid_err.(sin, 0, f, ns); label="mid")plot!(ns, ns .^ (-1); label="1/n")plot!(ns, ns .^ (-2); label="1/n^2")This is a log-log plot:we scale both
Now consider a scalar linear time-evolution problem for
Label the
Definition (Restriction matrices) Define the
restriction matrices as
Again we can replace the discretisation using finite-differences, giving us
for
Putting everything together we have the system:
where
Here is a simple example for solving:
which has an exact solution in terms of a special error function (which we determined using Mathematica).
xxxxxxxxxxusing SpecialFunctionsc = 1a = t -> tn = 2000t = range(0, 1; length=n)## exact solution, found in Mathematicau = t -> -(1/2)*exp(-(1+t^2)/2)*(-2sqrt(ℯ) + sqrt(2π)erfi(1/sqrt(2)) - sqrt(2π)erfi((1 + t)/sqrt(2)))
h = step(t)L = Bidiagonal([1; fill(1/h, n-1)], a.(t[1:end-1]) .- 1/h, :L)
norm(L \ [c; exp.(t[1:end-1])] - u.(t),Inf)We see that it is converging to the true result.
Note that this is a simple forward-substitution of a bidiagonal system, so we can also just construct it directly:
Remark: (advanced) Note this can alternatively be reduced to an integral
and solved as above but this approach is harder to generalise.
In Backward Euler we replace the forward-difference with a backward-difference, that is
This leads to the system:
Again this is a bidiagonal forward-substitution:
That is,
We can also solve systems, that is, equations of the form:
where
Forward Euler gives us:
That is, each time-step consists of matrix-vector multiplication. On the other hand Backward Euler requires inverting a matrix at each time-step:
Example (Airy equation) Consider the (negative-time) Airy equation:
We can recast it as a system by defining
which satisfies
It is natural to represent the time-slices
xxxxxxxxxxn = 100_000t = range(0, 50; length=n)A = t -> [0 1; -t 0]h = step(t)
U = zeros(2, n) # each column is a time-sliceU[:,1] = [1.0,0.0] # initial conditionfor k = 1:n-1 U[:,k+1] = (I + h*A(t[k]))*U[:,k]end
plot(t, U')We leave implementation of backward Euler as a simple exercise.
Example (Heat on a graph)
Those who took Introduction to Applied Mathematics will recall
heat equation on a graph. Consider a simple graph of
If we denote the heat at time
We consider the case of a periodic forcing at the middle node
Heat equation on this lattice is defined as follows:
We can employ forward and backward Euler:
xxxxxxxxxxn = 1_000 # number of time-stepst = range(0, 100; length=n)h = step(t)
m = 50 # number of nodes
Δ = SymTridiagonal([-1; fill(-2.0, m-2); -1], ones(m-1))ω = 1f = t -> cos(ω*t) # periodic forcing with period 1
Uᶠ = zeros(m, n) # each column is a time-slice for forward EulerUᵇ = zeros(m, n) # each column is a time-slice for backwar Euler
Uᶠ[:,1] = Uᵇ[:,1] = zeros(m) # initial condition
for k = 1:n-1 Uᶠ[:,k+1] = (I + h*Δ)*Uᶠ[:,k] Uᶠ[m÷2,k+1] += h*f(t[k]) # add forcing at 𝐞_1end
𝐞 = zeros(m); 𝐞[m÷2] = 1;
for k = 1:n-1 Uᵇ[:,k+1] = (I - h*Δ)\(Uᵇ[:,k] + h*f(t[k+1])𝐞)end
scatter(Uᶠ[:,end]; label="forward")scatter!(Uᵇ[:,end]; label="backward")Both match!
Remark: If you change the number of time-steps to be too small, for example n = 100, forward
Euler blows up while backward Euler does not. This will be discussed in the problem
sheet.
Remark: (advanced) Memory allocations are very expensive so in practice one should preallocate and use memory.
Forward-Euler extends naturally to nonlinear equations, including the vector case:
becomes:
Here we show a simple solution to a nonlinear Pendulum:
by writing
xxxxxxxxxxn = 1000𝐮 = fill(zeros(2), n)x = range(0, 20; length=n)h = step(x) # same as x[k+1]-x[k]
𝐮[1] = [1,0]for k = 1:n-1 𝐮[k+1] = 𝐮[k] + h * [𝐮[k][2],-sin(𝐮[k][1])]end
plot(x, first.(𝐮))As we see it correctly predicts the oscillatory behaviour of a pendulum, and matches the simulation using DifferentialEquations.jl above.
Here we will only consider one discretisation as it is symmetric:
That is we use the
Example (Poisson) Consider the Poisson equation with Dirichlet conditions:
which we discretise as
As a linear system this equation becomes:
Thus we solve:
xxxxxxxxxxx = range(0, 1; length = n)h = step(x)T = Tridiagonal([fill(1/h^2, n-2); 0], [1; fill(-2/h^2, n-2); 1], [0; fill(1/h^2, n-2)])u = T \ [1; exp.(x[2:end-1]); 2]scatter(x, u)We can test convergence on
We observe uniform (
xxxxxxxxxxfunction poisson_err(u, c_0, c_1, f, n) x = range(0, 1; length = n) h = step(x) T = Tridiagonal([fill(1/h^2, n-2); 0], [1; fill(-2/h^2, n-2); 1], [0; fill(1/h^2, n-2)]) uᶠ = T \ [c_0; f.(x[2:end-1]); c_1] norm(uᶠ - u.(x), Inf)end
u = x -> cos(x^2)f = x -> -4x^2*cos(x^2) - 2sin(x^2)
ns = 10 .^ (1:8) # solve up to n = 10 millionscatter(ns, poisson_err.(u, 1, cos(1), f, ns); xscale=:log10, yscale=:log10, label="error")plot!(ns, ns .^ (-2); label="1/n^2")
We now study convergence of the approaches for the constant-coefficient case. We will use Toeplitz matrices as a tool to simplify the explanation.
Definition (Toeplitz) A Toeplitz matrix has constant diagonals:
.
Proposition (Bidiagonal Toeplitz inverse) The inverse of a
bidiagonal Toeplitz matrix is:
Theorem (Forward/Backward Euler convergence) Consider the equation
Denote
Assume that
is twice-differentiable with uniformly bounded second derivative. Then the error for forward/backward Euler is
Proof We prove the error bound for forward Euler as backward Euler is similar. This proof consists of two stages: (1) consistency and (2) stability.
Consistency means our discretisation approximates the true equation, that is:
where
Stability means the inverse does not blow up the error. We need to be a bit careful and first write, for
Stability in this case is the statement that
using the fact (which likely you have seen in first year) that
We now combine stability and consistency. We have:
For 2D problems we consider Poisson. The first stage is to row-reduce to get a symmetric tridiagonal (pos. def.) matrix:
Considering the right-hand side and dropping the first and last rows our equation becomes:
Remark: (advanced) You may recognise
Theorem (Poisson convergence) Suppose that
is four-times differentiable with uniformly bounded fourth derivative. Then the finite difference approximation to Poisson converges like .
Proof
For consistency we need the Taylor series error of second-order finite differences. We have
where
(Note this is a slightly stronger result than used in the solution of PS2.) Thus we have consistency:
where
Following PS5 we deduce that it has the Cholesky-like decomposition
We can invert:
Thus we have stability:
Putting everything together we have, for
What about the observed instability? The condition number of the matrix provides an intuition (though not a proof: condition numbers are only upper bounds!). Here we have
One can show by looking at
YouTube Lectures: Fourier Series Trapezium Rule and Fourier Coefficients Discrete Fourier Transform (DFT) Fast Fourier Transform (FFT)
In Part III, Computing with Functions, we work with approximating functions by expansions in bases: that is, instead of approximating at a grid (as in the Differential Equations chapter), we approximate functions by other, simpler, functions. The most fundamental basis is (complex) Fourier series:
where
In numerical analysis we try to build on the analogy with linear algebra as much as possible. Therefore we write this as:
Where the underline indicates the zero-index location.
More precisely, we are going to build an approximation using
This can be thought of as an approximate Taylor expansion using the change-of-variables
In analysis one typically works with continuous functions and relates results to continuity. In numerical analysis we inheritely have to work with vectors, so it is more natural to focus on the case where the Fourier coefficients
We first state a basic results (whose proof is beyond the scope of this module):
Theorem (convergence) If the Fourier coeffients are absolutely convergent then
which converges uniformly.
Remark: (advanced) We also have convergence for the continuous version
of the
for any function such that
Fortunately, continuity gives us sufficient (though not necessary) conditions for absolute convergence:
Proposition (differentiability and absolutely convergence) If
and are periodic and is uniformly bounded, then its Fourier coefficients satisfy
Proof
Integrate by parts twice using the fact that
thus uniform boundedness of
using the dominant convergence test.
This condition can be weakened to Lipschitz continuity but the proof is beyond the scope
of this module.
Of more practical importance is the other direction: the more times differentiable a function the
faster the coefficients decay, and thence the faster Fourier series converges.
In fact, if a function is smooth and 2π-periodic its Fourier coefficients decay
faster than algebraically: they decay like
Remark: (advanced) Going further, if we let
Let
But if
Define the Trapezium rule approximation to the Fourier coefficients by:
Lemma (Discrete orthogonality) We have:
In other words,
Proof
Consider
(Case 1:
(Case 2
Then we have
where we use the fact that
Theorem (discrete Fourier coefficients) If
is absolutely convergent then
Proof
Note that there is redundancy:
Corollary (aliasing) For all
, .
In other words if we know
where
We first discuss the case when all negative coefficients are zero,
noting that the Fourier series is in fact a Taylor series if we let
That is,
We can prove convergence whenever of this
approximation whenever
Theorem (Taylor series converges) If
and is absolutely convergent then converges uniformly to
.
Proof
which goes to zero as
For the general case we need to choose a range of coefficients that includes roughly an equal number of negative and positive coefficients (preferring negative over positive in a tie as a convention):
In the problem sheet we will prove this converges provided the coefficients are absolutely convergent.
We note that the map from values to coefficients can be defined as a matrix-vector product using the DFT:
Definition (DFT) The Discrete Fourier Transform (DFT) is defined as:
for the
-th root of unity . Note that
That is, we have
The choice of normalisation constant is motivated by the following:
Proposition (DFT is Unitary)
is unitary: .
Proof
In other words,
Corollary (Interpolation)
interpolates at :
Proof We have
We will leave extending this result to the problem sheet. Note that regardless of choice of coefficients we interpolate, though some interpolations are better than others:
xxxxxxxxxxusing Plots, LinearAlgebra
## evaluates f_n at a pointfunction finitefourier(𝐟̂ₙ, θ) m = n ÷ 2 # use coefficients between -m:m ret = 0.0 + 0.0im # coefficients are complex so we need complex arithmetic for k = 0:m ret += 𝐟̂ₙ[k+1] * exp(im*k*θ) end for k = -m:-1 ret += 𝐟̂ₙ[end+k+1] * exp(im*k*θ) end retend
function finitetaylor(𝐟̂ₙ, θ) ret = 0.0 + 0.0im # coefficients are complex so we need complex arithmetic for k = 0:n-1 ret += 𝐟̂ₙ[k+1] * exp(im*k*θ) end retend
f = θ -> exp(cos(θ))n = 7θ = range(0,2π; length=n+1)[1:end-1] # θ_0, …,θ_{n-1}, dropping θ_n == 2πQₙ = 1/sqrt(n) * [exp(-im*(k-1)*θ[j]) for k = 1:n, j=1:n]𝐟̂ₙ = 1/sqrt(n) * Qₙ * f.(θ)
fₙ = θ -> finitefourier(𝐟̂ₙ, θ)tₙ = θ -> finitetaylor(𝐟̂ₙ, θ)
g = range(0, 2π; length=1000) # plotting gridplot(g, f.(g); label="function", legend=:bottomright)plot!(g, real.(fₙ.(g)); label="Fourier")plot!(g, real.(tₙ.(g)); label="Taylor")scatter!(θ, f.(θ); label="samples")We now demonstrate the relationship of Taylor and Fourier coefficients and their discrete approximations for some examples:
Example Consider the function
Under the change of variables
i.e.,
If we use an
We can verify this numerically:
xxxxxxxxxxf = θ -> 2/(2 - exp(im*θ))n = 7θ = range(0,2π; length=n+1)[1:end-1] # θ_0, …,θ_{n-1}, dropping θ_n == 2πQₙ = 1/sqrt(n) * [exp(-im*(k-1)*θ[j]) for k = 1:n, j=1:n]
Qₙ/sqrt(n)*f.(θ) ≈ 2 .^ (n .- (0:n-1)) / (2^n-1)Applying
The key observation is that hidden in
where
That is,
xxxxxxxxxxn = 4σ = [1:2:2n-1; 2:2:2n]P_σ = I(2n)[σ,:]and so
In other words, we reduced the DFT to two DFTs applied to vectors of half the dimension.
We can see this formula in code:
xxxxxxxxxxfunction fftmatrix(n) θ = range(0,2π; length=n+1)[1:end-1] # θ_0, …,θ_{n-1}, dropping θ_n == 2π [exp(-im*(k-1)*θ[j]) for k = 1:n, j=1:n]/sqrt(n)end
Q₂ₙ = fftmatrix(2n)Qₙ = fftmatrix(n)Dₙ = Diagonal([exp(im*k*π/n) for k=0:n-1])(P_σ'*[Qₙ' Qₙ'; Qₙ'*Dₙ -Qₙ'*Dₙ])[1:n,1:n] ≈ sqrt(2)Q₂ₙ'[1:n,1:n]Now assume
Remark: The FFTW.jl package wraps the FFTW (Fastest Fourier Transform in the West) library,
which is a highly optimised implementation
of the FFT that also works well even when
xxxxxxxxxxusing FFTWf = θ -> exp(cos(θ-0.1))n = 31m = n÷2## evenly spaced points from 0:2π, dropping last nodeθ = range(0, 2π; length=n+1)[1:end-1]
## fft returns discrete Fourier coefficients n*[f̂ⁿ_0, …, f̂ⁿ_(n-1)]fc = fft(f.(θ))/n
## We reorder using [f̂ⁿ_(-m), …, f̂ⁿ_(-1)] == [f̂ⁿ_(n-m), …, f̂ⁿ_(n-1)]## == [f̂ⁿ_(m+1), …, f̂ⁿ_(n-1)]f̂ = [fc[m+2:end]; fc[1:m+1]]
## equivalent to f̂ⁿ_(-m)*exp(-im*m*θ) + … + f̂ⁿ_(m)*exp(im*m*θ)fₙ = θ -> transpose([exp(im*k*θ) for k=-m:m]) * f̂
## plotting gridg = range(0, 2π; length=1000)plot(abs.(fₙ.(g) - f.(g)))Thus we have successfully approximate the function to roughly machine precision.
The magic of the FFT is because it's
xxxxxxxxxxf = θ -> exp(sin(θ))/(1+1e6cos(θ)^2)n = 100_001m = n÷2## evenly spaced points from 0:2π, dropping last nodeθ = range(0, 2π; length=n+1)[1:end-1]
## fft returns discrete Fourier coefficients n*[f̂ⁿ_0, …, f̂ⁿ_(n-1)]fc = fft(f.(θ))/n
## We reorder using [f̂ⁿ_(-m), …, f̂ⁿ_(-1)] == [f̂ⁿ_(n-m), …, f̂ⁿ_(n-1)]## == [f̂ⁿ_(m+1), …, f̂ⁿ_(n-1)]f̂ = [fc[m+2:end]; fc[1:m+1]]
plot(abs.(fc); yscale=:log10, legend=:bottomright, label="default")plot!(abs.(f̂); yscale=:log10, label="reordered")We can use the FFT to compute some mathematical objects efficiently. Here is a simple example.
Example Define the following infinite sum (which has no name apparently, according to Mathematica):
We can use the FFT to compute
where we know the Fourier coefficients from the Taylor series of
Thus we have
YouTube Lectures: Orthogonal Polynomials Jacobi Matrices Classical Orthogonal Polynomials
Fourier series proved very powerful for approximating periodic functions. If periodicity is lost, however, uniform convergence is lost. In this chapter we introduce alternative bases, orthogonal polynomials (OPs) built on polynomials that are applicable in the non-periodic setting. That is we consider expansions of the form
where
Why not use monomials as in Taylor series? Hidden in the previous lecture was that we could effectively
compute Taylor coefficients by evaluating on the unit circle in the complex plane, only if the radius of convergence
was 1. Many functions are smooth on say
While orthogonal polynomials span the same space as monomials, and therefore we can in theory write an approximation in monomials, orthogonal polynomials are much more stable.
In addition to numerics, OPs play a very important role in many mathematical areas including functional analysis, integrable systems, singular integral equations, complex analysis, and random matrix theory.
Definition (graded polynomial basis) A set of polynomials
is graded if is precisely degree : i.e., for
.
Note that if
Definition (orthogonal polynomials) Given an (integrable) weight
for , which defines a continuous inner product a graded polynomial basis
are orthogonal polynomials (OPs) if whenever
.
Note in the above
Definition (orthonormal polynomials) A set of orthogonal polynomials
are orthonormal if .
Definition (monic orthogonal polynomials) A set of orthogonal polynomials
are orthonormal if .
Proposition (expansion) If
is a degree polynomial, are orthogonal and are orthonormal then
Proof
Because
for constants
Corollary (zero inner product) If a degree
polynomial satisfies then
.
OPs are uniquely defined (up to a constant) by the property that they are orthogonal to all lower degree polynomials.
Proposition (orthogonal to lower degree) A polynomial
of precisely degree satisfies for all degree
polynomials if and only if . Therefore an orthogonal polynomial is uniquely defined by .
Proof
As
Thus by linearity of inner products we have
Now for
consider
Thus it is zero, i.e.,
A consequence of this is that orthonormal polynomials are always a constant multiple of orthogonal polynomials.
The most fundamental property of orthogonal polynomials is their three-term recurrence.
Theorem (3-term recurrence, 2nd form) If
are OPs then there exist real constants such that Proof The
case is immediate since are a basis of degree 1 polynomials. The case follows from for
as is of degree .
Note that
since
Clearly if
Corollary (monic 3-term recurrence) If
are monic then .
Example What are the monic OPs
We know from the 3-term recurrence that
where
Thus
From
we have
Finally, from
we have
The three-term recurrence can also be interpreted as a matrix known as the Jacobi matrix:
Corollary (Jacobi matrix) For
then we have
More generally, for any polynomial
we have
For the special cases of orthonormal and monic polynomials we have extra structure:
Corollary (orthonormal 3-term recurrence) If
are orthonormal then its recurrence coefficients satisfy . That is, the Jacobi matrix is symmetric:
Proof
Remark Typically the Jacobi matrix is the transpose
Remark (advanced) If you are worried about multiplication of infinite matrices/vectors note it is well-defined by the standard definition because it is banded. It can also be defined in terms of functional analysis where one considers these as linear operators (functions of functions) between vector spaces.
Remark (advanced) Every integrable weight generates a family of orthonormal polynomials, which in turn generates a symmetric Jacobi matrix.
There is a "Spectral Theorem for Jacobi matrices" that says one can go the other way: every tridiagonal symmetric matrix with bounded entries is a Jacobi matrix for some integrable weight with compact support. This is an example of what Barry Simon calls a ''Gem of spectral theory'', that is.
Example (uniform weight Jacobi matrix) Consider the
monic orthogonal polynomials
We can compute the orthonormal polynomials, using
as:
which have the Jacobi matrix
which is indeed symmetric. The problem sheet explores a more elegant way of doing this.
Example (expansion) Consider expanding a low degree polynomial like
Thus we have:
We multiply (using that
Classical orthogonal polynomials are special families of orthogonal polynomials with a number of beautiful properties, for example
As stated above orthogonal polynomials are uniquely defined by the weight
Other important families discussed are
Definition (Chebyshev polynomials, 1st kind)
are orthogonal with respect to and satisfy:
Definition (Chebyshev polynomials, 2nd kind)
are orthogonal with respect to .
Theorem (Chebyshev T are cos)
In other words
Proof
We need to show that
Property (2) follows under a change of variables:
if
To see that they are graded we use the fact that
In other words
which completes the proof.
Buried in the proof is the 3-term recurrence:
Corollary
In the problem sheet you will show the following:
Theorem (Chebyshev U are sin) For
, which satisfy:
Definition (Pochhammer symbol) The Pochhammer symbol is
Definition (Legendre) Legendre polynomials
are orthogonal polynomials with respect to on , with
The reason for this complicated normalisation constant is both historical and that it leads to simpler formulae for recurrence relationships.
Classical orthogonal polynomials have Rodriguez formulae, defining orthogonal polynomials as high order derivatives of simple functions. In this case we have:
Theorem (Legendre Rodriguez)
Proof We need to verify:
(1) follows since its a degree
(3) follows since:
which satisfies:
Theorem (Legendre 3-term recurrence)
TBA
Consider integration
For periodic integration we approximated (using the Trapezium rule) an integral by a sum. We can think of it as a weighted sum:
where
as seen by the formula
In this section we consider other quadrature rules
We want to choose
The key to property (1) is to use roots (zeros) of
Lemma
has exactly distinct roots.
Proof
Suppose
for
does not change sign. In other words:
This is only possible if
Lemma (zeros) The zeros
of are the eigenvalues of the truncated Jacobi matrix More precisely,
for the orthogonal matrix
Proof
We construct the eigenvector (noting
YouTube Lectures: Polynomial Interpolation Orthogonal Polynomial Roots Interpolatory Quadrature Rules Gaussian Quadrature
Polynomial interpolation is the process of finding a polynomial that equals data at a precise set of points. Quadrature is the act of approximating an integral by a weighted sum:
In these notes we see that the two concepts are intrinsically linked: interpolation leads naturally
to quadrature rules. Moreover, by using a set of points
We already saw a special case of polynomial interpolation, where we saw that the polynomial
equaled
Definition (interpolatory polynomial) Given
distinct points and samples , a degree interpolatory polynomial satisfies
The easiest way to solve this problem is to invert the Vandermonde system:
Definition (Vandermonde) The Vandermonde matrix associated with
distinct points is the matrix
Proposition (interpolatory polynomial uniqueness) The interpolatory polynomial is unique, and the Vandermonde matrix is invertible.
Proof
Suppose
For the second part, if
hence
∎
Thus a quick-and-dirty way to to do interpolation is to invert the Vandermonde matrix (which we saw in the least squares setting with more samples then coefficients):
xxxxxxxxxxusing Plots, LinearAlgebraf = x -> cos(10x)n = 5
x = range(0, 1; length=n)# evenly spaced points (BAD for interpolation)V = x .^ (0:n-1)' # Vandermonde matrixc = V \ f.(x) # coefficients of interpolatory polynomialp = x -> dot(c, x .^ (0:n-1))
g = range(0,1; length=1000) # plotting gridplot(g, f.(g); label="function")plot!(g, p.(g); label="interpolation")scatter!(x, f.(x); label="samples")But it turns out we can also construct the interpolatory polynomial directly.
We will use the following with equal
Definition (Lagrange basis polynomial) The Lagrange basis polynomial is defined as
Plugging in the grid points verifies the following:
Proposition (delta interpolation)
We can use these to construct the interpolatory polynomial:
Theorem (Lagrange interpolation) The unique polynomial of degree at most
that interpolates at is:
Proof Note that
so we just need to show it is unique. Suppose
∎
Example We can interpolate
Remark Interpolating at evenly spaced points is a really bad idea: interpolation is inheritely ill-conditioned. The problem sheet asks you to explore this experimentally.
We now consider roots (zeros) of orthogonal polynomials
Lemma
has exactly distinct roots.
Proof
Suppose
for
does not change sign. In other words:
This is only possible if
∎
Definition (truncated Jacobi matrix) Given a symmetric Jacobi matrix
, (or the weight whose orthonormal polynomials are associated with ) the truncated Jacobi matrix is
Lemma (zeros) The zeros
of are the eigenvalues of the truncated Jacobi matrix . More precisely, for the orthogonal matrix
where
.
Proof
We construct the eigenvector (noting
The result follows from normalising the eigenvectors. Since
∎
Example (Chebyshev roots) Consider
Consider the
We also have from the 3-term recurrence:
We orthonormalise by rescaling
so that the Jacobi matrix is symmetric:
We can then confirm that we have constructed an eigenvector/eigenvalue of the
Definition (interpolatory quadrature rule) Given a set of points
the interpolatory quadrature rule is: where
Proposition (interpolatory quadrature is exact for polynomials) Interpolatory quadrature is exact for all degree
polynomials :
Proof The result follows since, by uniqueness of interpolatory polynomial:
∎
Example (arbitrary points) Find the interpolatory quadrature rule for
That is we have
This is indeed exact for polynomials up to degree
Example (Chebyshev roots) Find the interpolatory quadrature rule for
Recall from before that
(It's not a coincidence that they are all the same but this will differ for roots of other OPs.) That is we have
This is indeed exact for polynomials up to degree
We shall explain this miracle next.
Gaussian quadrature is the interpolatory quadrature rule corresponding
to the grid
Definition (Gauss quadrature) Given a weight
, the Gauss quadrature rule is: where
are the roots of and Equivalentally,
are the eigenvalues of and (Note we have
.)
In analogy to how Fourier series are orthogonal with respect to Trapezium rule, Orthogonal polynomials are orthogonal with respect to Gaussian quadrature:
Lemma (Discrete orthogonality) For
,
Proof
∎
Just as approximating Fourier coefficients using Trapezium rule gives a way of interpolating at the grid, so does Gaussian quadrature:
Theorem (interpolation via quadrature)
interpolates
at the Gaussian quadrature points .
Proof
Consider the Vandermonde-like matrix:
and define
so that
Note that if
But we see that (similar to the Fourier case)
∎
Corollary Gaussian quadrature is an interpolatory quadrature rule with the interpolation points equal to the roots of
:
Proof We want to show its the same as integrating the interpolatory polynomial:
∎
A consequence of being an interpolatory quadrature rule is that it is exact for all
polynomials of degree
Theorem (Exactness of Gauss quadrature) If
is a degree polynomial then Gauss quadrature is exact:
Proof Using polynomial division algorithm (e.g. by matching terms) we can write
where
∎
Example (Chebyshev expansions)
Consider the construction of Gaussian quadrature for
We can check each case and deduce that
We can use this to expand a polynomial, e.g.
In other words:
which can be easily confirmed.